Univariate Plots Section

## [1] 4898   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The percentage of alcohol in wine bounds from 8 and 14.20 percent. About 75% of the wines have a residual sugar value below 10 grams/liter (over 45 are considered sweet). Some wines have not citric acid. Mean quality is 5.878, max and min quality are 9 and 3 respectively.

Quality distribution appears unimodal normal distribution with center in 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Most alcohol values are around 9.5 grades, and follow a right skewed normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Most sulphates values are around 0.45 grades, and follow a right skewed normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH follow a normal distribution with mean near 3.2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density follow a normal distribution with mean in 0.994 and some outliers over 1.01

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The total sulfur dioxide follow a normal distribution with mean in 138.4 and some outliers over 275. The 50% if the data is between 108 and 167

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The free sulfur dioxide follow a normal distribution with mean near 35 and some outliers over 90. The 50% if the data is between 23 and 46

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Almost 50% of chlorides are between 0.036 and 0.05.

Transformed the long tail data to better understand the distribution of residual sugar. The tranformed residual sugar distribution appears bimodal with the price peaking around 1.5 or so and again at 10 or so. This is one interesting plot.

Based on the context of this feature, it’s a good candidate to transform it in a new factorized variable. The description says that it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet so our new variable could be:

##   RARE NORMAL  SWEET 
##     77   4820      1

Sadly there aren’t enought cases in SWEET and RARE to consider this feature.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The citric acid follow a normal distribution with mean near 0.33 and some outliers over 0.9. The 50% if the data is between 0.27 and 0.39.

Setting the binwidth to 0.01 we can see an anormal amount of values around 0.5 (0.49 exactly).

## 
##  0.3 0.28 0.32 0.34 0.29 0.26 0.27 0.49 0.31 0.33 0.24 0.36 0.35 0.25 0.37 
##  307  282  257  225  223  219  216  215  200  183  181  177  137  136  134 
## 0.38  0.4 0.22 0.39 0.42 0.23 0.41  0.2 0.21 0.44 0.46 0.18 0.19 0.45 0.74 
##  122  117  104  101   95   83   82   70   66   63   51   49   48   46   41 
## 0.48 0.47 0.43  0.5 0.16 0.14 0.17 0.51 0.15 0.52 0.56 0.58    0 0.12 0.54 
##   39   38   37   35   33   27   27   25   23   23   22   21   19   19   19 
## 0.13 0.53  0.1 0.62 0.57 0.04 0.07 0.09 0.55 0.61 0.71 0.65 0.01 0.66 0.67 
##   17   16   14   14   13   12   12   12   11    9    9    8    7    7    7 
## 0.68 0.02 0.06 0.59  0.6 0.64 0.05 0.69 0.72 0.73    1 0.08 0.63  0.7 0.03 
##    7    6    6    6    6    6    5    5    5    5    5    4    4    3    2 
## 0.78 0.79  0.8 0.81 0.82 0.91 0.11 0.86 0.88 0.99 1.23 1.66 
##    2    2    2    2    2    2    1    1    1    1    1    1

Transformed the long tail data to better understand the distribution of volatile acidity. The tranformed volatile acidity distribution appears normal with the acifuty peaking around 0.25

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The fixed acidity follow a normal distribution with mean near 6.85 and some outliers over 10 and under 4. The 50% if the data is between 6.3 and 7.3

A new feature could be obtained using the acidities. Some references says that the total acidity is the amount of fixed acidity plus the volatile acidity. But the measure of fixed acidity should be setted (for an easier understanding) to just tartaric acid and not all the non-volatile acids so our total acidity is going to be the sum of all the acids.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.890   7.405   7.467   7.960  14.960

Univariate Analysis

What is the structure of your dataset?

There are 4898 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality).

All the features are numerical, even the quality that is based on a score of 0 to 10. This feature is the easiest one to be factorized for an easy plot interpretations but we are going to mantain both.

Main thoughts:

Seeing the univariate plots, most of the features follow normal distributions with few variability but some outliers.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is the quality. The main idea is try to predict the quality of a wine. To accomplish this issue let see what is the behaviour of the quality with some other features.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I’ve learned, reviewing some internet articles, that a good wine quality is given by this formula: Sweet Taste (sugars + alcohols) <= => Acid Taste (acids) + Bitter Taste (phenols). In the case of white wines, the concentration of phenols (tannins, that gives the red color of the wine) are insignificant. So the interesting features for this analysis will be: fixed.acidity, volatile.acidity, citric.acid, total.acidity, residual.sugar, alcohol and quality.

Did you create any new variables from existing variables in the dataset?

Total acidity is a combination of all the acids. Also the residual sugar has been categorized but the few quantity of data in some categories made this one useless.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Residual sugar maybe is the most unusual distribution cause for a better understanding of the data a log10 has been applied and appears a bimodal distribution. The rest of them seems to be normal distributions, some of them right skewed.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
## total.acidity           0.98717874       0.07157062  0.394143356
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
## total.acidity            0.10473749  0.04552987       -0.0451333172
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
## total.acidity                 0.113188502  0.27560881 -0.4306513315
##                        sulphates     alcohol      quality total.acidity
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831    0.98717874
## volatile.acidity     -0.03572815  0.06771794 -0.194722969    0.07157062
## citric.acid           0.06233094 -0.07572873 -0.009209091    0.39414336
## residual.sugar       -0.02666437 -0.45063122 -0.097576829    0.10473749
## chlorides             0.01676288 -0.36018871 -0.209934411    0.04552987
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067   -0.04513332
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218    0.11318850
## density               0.07449315 -0.78013762 -0.307123313    0.27560881
## pH                    0.15595150  0.12143210  0.099427246   -0.43065133
## sulphates             1.00000000 -0.01743277  0.053677877   -0.01185225
## alcohol              -0.01743277  1.00000000  0.435574715   -0.11751272
## quality               0.05367788  0.43557472  1.000000000   -0.13137721
## total.acidity        -0.01185225 -0.11751272 -0.131377207    1.00000000

Some unexpected correlations appears in the features. The density seems to be correlated with the residual sugar and with the alcohol. So lets insclude these ones in the investigation features.

##                residual.sugar    density    alcohol     quality
## residual.sugar     1.00000000  0.8389665 -0.4506312 -0.09757683
## density            0.83896645  1.0000000 -0.7801376 -0.30712331
## alcohol           -0.45063122 -0.7801376  1.0000000  0.43557472
## quality           -0.09757683 -0.3071233  0.4355747  1.00000000
## total.acidity      0.10473749  0.2756088 -0.1175127 -0.13137721
##                total.acidity
## residual.sugar     0.1047375
## density            0.2756088
## alcohol           -0.1175127
## quality           -0.1313772
## total.acidity      1.0000000

The main objective is to know how this features affect the wine quality, but first lest see how others features are related.

Here is a very strong relationship between the residual sugar and the density of the wine. In fact the correlation is 0.84.

Also we see a strong relationship between density and alcohol as we could advance with the correlation of -0.78.

No relationship can be shown, in fact the correlation value for this pairs is -0.13. Seeing the linear model of these features (blue line) we can appreciate almost an horizontal line. This means the slope (total acidity value) has very few importance in this equation.

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.645   7.102   7.935   8.269   9.163  12.410 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.570   7.020   7.590   7.815   8.320  11.520 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   6.970   7.500   7.574   8.120  11.030 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.860   7.370   7.436   7.940  14.960 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.730   6.810   7.310   7.323   7.820   9.870 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.525   6.810   7.370   7.261   7.760   8.930 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.250   7.600   7.850   8.104   8.000   9.820
## quality: 3
## [1] 20
## -------------------------------------------------------- 
## quality: 4
## [1] 163
## -------------------------------------------------------- 
## quality: 5
## [1] 1457
## -------------------------------------------------------- 
## quality: 6
## [1] 2198
## -------------------------------------------------------- 
## quality: 7
## [1] 880
## -------------------------------------------------------- 
## quality: 8
## [1] 175
## -------------------------------------------------------- 
## quality: 9
## [1] 5

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.588   4.600   6.392  10.700  16.200 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

Again any relationship between these features as expected and seeing the linear model of these features we can appreciate almost an horizontal line.

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## quality: 3
## [1] 20
## -------------------------------------------------------- 
## quality: 4
## [1] 163
## -------------------------------------------------------- 
## quality: 5
## [1] 1457
## -------------------------------------------------------- 
## quality: 6
## [1] 2198
## -------------------------------------------------------- 
## quality: 7
## [1] 880
## -------------------------------------------------------- 
## quality: 8
## [1] 175
## -------------------------------------------------------- 
## quality: 9
## [1] 5

Here a small relationship could be seen. It seems for this dataset the quality of wine increases with the alcohol graduation.

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970
## quality: 3
## [1] 20
## -------------------------------------------------------- 
## quality: 4
## [1] 163
## -------------------------------------------------------- 
## quality: 5
## [1] 1457
## -------------------------------------------------------- 
## quality: 6
## [1] 2198
## -------------------------------------------------------- 
## quality: 7
## [1] 880
## -------------------------------------------------------- 
## quality: 8
## [1] 175
## -------------------------------------------------------- 
## quality: 9
## [1] 5

In this case, it seems to be a small relationship between these features but with a low correlation so not important.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The main relationships in this bivariate analysis are found related with the alcohol feature. We could see that it has a strong relationship with the density and the residual sugar.

But no single relationship (at leats remarkable) could be found with the quality. Each of the features analyzed aren’t somehow related with the quality. This is something we can expected because is not that easy to have a good wine quality, isn’t it?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The most interesting relationships involve the density feature. In fact seeing the correlations between features, density has almost always the highest values.

What was the strongest relationship you found?

The strongest relationship is between density and residual sugar. A correlation of 0.84 gives us a strong relationship. Also density with alcohol (-0.78) are strongly correlated.

Multivariate Plots Section

Here we can see that with higher quality values density vs alcohol values seems to be in the top left of the graph and with lower values density vs alcohol fall in the left side (always following the lm, that look similar for every wine quality).

These plots shows that the correlation exists for every quality and seems that values goes from right to left in the linear model when the quality increases.

Now lets see the behaviour of the quality vs the other features of interest

These plots shows how difficult is to obtain a goof quality wine. Very few relationships could be found. In the next section will be explained some thoughts about why this happens.

## 
## Calls:
## m1: lm(formula = quality ~ total.acidity, data = wines)
## m2: lm(formula = quality ~ total.acidity + log10(residual.sugar), 
##     data = wines)
## m3: lm(formula = quality ~ total.acidity + log10(residual.sugar) + 
##     alcohol, data = wines)
## 
## ==========================================================
##                             m1         m2         m3      
## ----------------------------------------------------------
##   (Intercept)             6.856***   6.899***   2.730***  
##                          (0.106)    (0.107)    (0.155)    
##   total.acidity          -0.131***  -0.126***  -0.086***  
##                          (0.014)    (0.014)    (0.013)    
##   log10(residual.sugar)             -0.119***   0.288***  
##                                     (0.031)    (0.031)    
##   alcohol                                       0.343***  
##                                                (0.010)    
## ----------------------------------------------------------
##   R-squared                   0.0        0.0        0.2   
##   adj. R-squared              0.0        0.0        0.2   
##   sigma                       0.9        0.9        0.8   
##   F                          86.0       50.3      435.2   
##   p                           0.0        0.0        0.0   
##   Log-likelihood          -6312.0    -6304.8    -5775.5   
##   Deviance                 3774.7     3763.7     3032.2   
##   AIC                     12630.0    12617.6    11561.1   
##   BIC                     12649.4    12643.6    11593.6   
##   N                        4898       4898       4898     
## ==========================================================
## 
## Call:
## lm(formula = quality ~ total.acidity + log10(residual.sugar) + 
##     alcohol, data = wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5799 -0.5287 -0.0104  0.4770  3.2541 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.729844   0.154573  17.661  < 2e-16 ***
## total.acidity         -0.086298   0.012768  -6.759 1.55e-11 ***
## log10(residual.sugar)  0.288359   0.030592   9.426  < 2e-16 ***
## alcohol                0.343059   0.009984  34.361  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7871 on 4894 degrees of freedom
## Multiple R-squared:  0.2106, Adjusted R-squared:  0.2101 
## F-statistic: 435.2 on 3 and 4894 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As we could saw in the bivariate section, density with residual sugar and alcohol have a big correlation and as we can appreciate this happens with every wine quality.

Furthermore, a small relationship appears when mixing total acidity with residual sugar and alcohol. In the linear model a 0.2 appears for the R-squared value. This means a 20% of the quality variance could accounted.

Were there any interesting or surprising interactions between features?

As said before, the most interesting feature is the density, analyzed with alcohol and residual sugar. No special interaction could be seen in this section.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

In order to predict the wine quality I created a linear model trying to figure out that the balance of alcohol + sugar and total acidity gives a good wine quality. The model doesn’t seem to be very accurate (0.2) event the features used have influence in the model (3 stars in the m3 column).


Final Plots and Summary

Plot One

Description One

The distribution of residual sugar amount appears to be bimodal. This is not easy to explain, maybe a demand of a well differenced wine sweet flavour.

Plot Two

Description Two

Residual sugar is one of the most interesting feature because of its high correlation with others. In this case the relationship with the density is almost linear but trying to figure out if the quality has influence in this relationship don’t give us any clue.

Plot Three

Description Three

Quality levels have an small (lower than expected) relation with the total acidity and the alcohol + residual sugar combination. Higher quality wines seems to have mess acidity with higher alcohol and residual sugar value.

Reflection

The white wines data set contains information on almost 5000 wines. First of all an exploratory data analysis was performed to understand the fearures. Also some internet investigation to contextualize and learn about the topic. This gave me some references about how quality could be calculated/predicted given some of the features already provided in the dataset. Before this some relations call my attention like the high relationship of the density with some other features like alcohol and residual sugar. Finally trying to find any relations to set a good quality was quite frustrating. Some internet investigations direct me to this formula: Sweet Taste (sugars + alcohols) <= => Acid Taste (acids). But the final though wasn’t as easy as it seems. I could find a small relationship between this features but looking at the resultant linear model a small qualtity of wines are accounted (20%).

Some conclusions I can extract is that the data set lacks of a more spreaded quality values. Almost all the wines are ‘NORMAL’ and it’s difficult the clusterize. Also I think that my analysis was a bit biased trying to predict the quality given the previous formula.

Bibliography

Some thoughts

The higher the sugar, the higher alcohol. Sweet Taste (sugars + alcohols) <= => Acid Taste (acids) + Bitter Taste (phenols) (tannins just red wines). Increase fixed acidity decreases the ph. Increase citric acid increases ph.